Proteome comparison
BLAST matrix theory Technical use Local BLAST using blastall and formatdb The NCBI BLAST webpage allows for comparing a query sequence against larges databases using the BLAST algorithm. As with the proteome comparisons presented here, it is sometimes useful to compare a query sequence to a singe genome or a small selection of genomes. The CMG-biotools does not include a local version of the NCBI databases, but does allow for the local construction of smaller databases and comparison with these. Construct a database from a FASTA file (A.fsa): formatdb -i A.fsa -p T -t 1 Compare sequences in FASTA file with database (protein BLAST): /usr/biotools/blast/blastall -F 0 -p blastp -d A.fsa -e 1e-5 -m 7 < C.fsa > qC_dA.blastout Be aware, that the complete path to blastall must be given for this to work (as shown). Post script modifications Bounding box If the names of the organisms reach outside the border of the picture, change the numbers in the BOUNDING BOX field of the .ps file. Each op these numbers correspond to different points in a coordinate system: %%BoundingBox: 0 0 2212 1110 %%BoundingBox: llx lly urx ury The numbers correspond to the size of the picture, and are coordinates llx lly urx ury (lower left corner (x,y), upper right corner (x,y)). Change the coordinates, save the ∗ps file and click the file again. Hash lines in matrix post script file If the matrix has a line for each organism saying something like (HASH\(0x954dba8\)) -45 rotate dup stringwidth pop neg 0 rmoveto show 45 rotate This can be removed by removing the entire block for that element of the picture 183.497474683058 ux 111.639610306789 uy moveto (HASH\(0x954dba8\)) -45 rotate dup stringwidth pop neg 0 rmoveto show 45 rotate newpath This can be done for all entries in the .ps file as follows: awk '/HASH/{c=1;next} c--<0 && p{print p} {p=$0} END{print p} ' test.ps > new.ps Pan- and core-genome plot The pan- and core-genome plot method looks at the cumulative set of all genes, shared across genomes (pan-genome) and the conserved set of gene families across all genomes (core-genome). The pan- and core-genomes are theoretical representations of a collective protein pool and a conserved protein pool, respectively. When a protein type is found in all genomes in a collection, it is called a core gene of this collection. Here this is implemented in a pan- and core-genome plot where sequences are compared using BLAST and the 50/50 % cutoff (50% identity over 50% coverage of the longest gene in the comparison). As the clusters grow to more than two members, single linkage clustering is used to assign a new sequence to a group. The program performing this analysis is called pancoreplot and the input is a tab separated text file representing a number of FASTA files containing amino acid sequences. This program takes all the protein FASTA files in a given directory. The protein FASTA file can be obtained by extracting proteins from a GenBank file (using saco extract) or by using the Prodigal genefinder (extract DNA from GenBank, saco convert, and find genes using prodigalrunner) For the first genome, the pan and core are identical, and the core becomes smaller with the addition of a second genome, as genes in this pool now need to be found in both genomes. If a gene from the core is not found in a new genome it is removed from the core, and is then only part of the pan-genome. The pan-genome is the entire gene pool and as such includes the core genome. The order of the genomes can change the course of the graph, but the final shared gene pool (core) will be the same. The program takes an input file with two columns. The first column is a genome description and can not contain space or tab. The second column is the file name, if the script is run in the same directory as the file: # Description Filename CNRZ1066 Streptococcus_thermophilus_CNRZ1066_ID_58221_prodigal.orf.fsa JIM_8232 Streptococcus_thermophilus_JIM_8232_ID_68521_prodigal.orf.fsa LMD-9 Streptococcus_thermophilus_LMD-9_ID_13773_prodigal.orf.fsa LMG_18311 Streptococcus_thermophilus_LMG_18311_ID_58219_prodigal.orf.fsa ND03 Streptococcus_thermophilus_ND03_ID_49149_prodigal.orf.fsa or a complete file path if the script is run in some other directory: # Description File_path CNRZ1066 /home/student/Strept_thermophilus_CNRZ1066_ID_58221_prodigal.orf.fsa JIM_8232 /home/student/Strep_thermophilus_JIM_8232_ID_68521_prodigal.orf.fsa LMD-9 /home/student/Strep_thermophilus_LMD-9_ID_13773_prodigal.orf.fsa LMG_18311 /home/student/Strept_thermophilus_LMG_18311_ID_58219_prodigal.orf.fsa ND03 /home/student/Strep_thermophilus_ND03_ID_49149_prodigal.orf.fsa The input file can be made manually or using this short-cut: ls -1 *orf.fsa | gawk '{print $1 "\t" $1}' > pancore.list The program writes the postscript file to the screen is not redirected using ">". The postscript file and a table of exact values can be found in the kept directory in the files named "ps" and "tbl" (these are the full file names) pancoreplot pancore.list -keep blastOutPut > pancore.ps pancoreplot -keep blastOutPut pancore.list > test.ps student@student-VirtualBox:~$ ls -l blastOutPut/ps -rw-r--r-- 1 student student 5265 2012-05-11 16:28 blastOutPut/ps student@student-VirtualBox:~$ ls -l blastOutPut/tbl -rw-r--r-- 1 student student 240 2012-05-11 16:28 blastOutPut/tbl [Code] Extract subsets of core and pan genomes See this page: http://biotoolscmg.wikia.com/wiki/Subset_genes,_pan-_and_core_genomes